{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DeepCoy\n",
    "\n",
    "This notebook serves as a simple workflow for generating and selecting decoy molecules using DeepCoy. \n",
    "\n",
    "Feel free to adapt and alter the workflow presented to the needs of your project. The workflow adopt closely follows the methods described in our manuscript, [Generating Property-Matched Decoys Using Deep Learning](https://www.biorxiv.org/content/10.1101/2020.08.26.268193v1). \n",
    "\n",
    "Any questions, comments or feedback, please email imrie@stats.ox.ac.uk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"../\")\n",
    "sys.path.append(\"../evaluation/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from rdkit import Chem\n",
    "from rdkit.Chem import AllChem\n",
    "from rdkit.Chem import Draw\n",
    "from rdkit.Chem.Draw import IPythonConsole\n",
    "from rdkit.Chem.Draw import MolDrawing, DrawingOptions\n",
    "from rdkit.Chem import MolStandardize\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "from itertools import product\n",
    "from joblib import Parallel, delayed\n",
    "import re\n",
    "from collections import defaultdict\n",
    "\n",
    "from IPython.display import clear_output\n",
    "IPythonConsole.ipython_useSVG = True\n",
    "\n",
    "from DeepCoy import DenseGGNNChemModel\n",
    "from data.prepare_data import read_file, preprocess\n",
    "from select_and_evaluate_decoys import select_and_evaluate_decoys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Whether to use GPU for generating molecules with DeLinker\n",
    "use_gpu = True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocess actives data\n",
    "\n",
    "We first need to preprocess the data used by DeepCoy. The active molecules should be supplied in a text file, with one entry on each line. \n",
    "\n",
    "E.g.\n",
    "```\n",
    "c1(ccc(cc1)c1c(ocn1)c1ccc2n(c1)c(nn2)c1c(cccc1)OC)F\n",
    "c1c(c(cc(c1)C(=O)NOC)Nc1c2c(c(cn2ncn1)C(=O)NCc1ccccc1)C)C\n",
    "```\n",
    "\n",
    "There should be no other entries on a line other than the SMILES string of the molecule to generate decoys for."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this example, we will use the actives for [DEKOIS 2.0 target P38-alpha](http://www.dekois.com/). For speed purposes, we will only use the first 10 actives."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Finished reading: 10 / 10\n",
      "Parsing smiles as graphs.\n",
      "Processed: 10 / 10\n",
      "Saving data.\n",
      "Length raw data: \t10\n",
      "Length processed data: \t10\n"
     ]
    }
   ],
   "source": [
    "data_path = './P38-alpha_actives.smi'\n",
    "\n",
    "raw_data = read_file(data_path)\n",
    "preprocess(raw_data, \"zinc\", \"P38-alpha_actives\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load DeepCoy model and generate decoys\n",
    "\n",
    "Let's now setup and generate candidate decoys with DeepCoy. The below settings generate 100 candidate decoys for each active molecule (note in our paper we generated 1000 candidates per active).\n",
    "\n",
    "It should take <10 minutes using a GPU and around 10-15 minutes using CPU only to generate 1000 candidate decoys on a consumer laptop. The exact time will vary depending on your hardware. The generation process is fully parallelisable if you need to generate large numbers of decoys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "if not use_gpu:\n",
    "    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'\n",
    "else:\n",
    "    os.environ['CUDA_VISIBLE_DEVICES'] = '0'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Arguments for DeepCoy\n",
    "args = defaultdict(None)\n",
    "args['--dataset'] = 'zinc'\n",
    "args['--config'] = '{\"generation\": true, \\\n",
    "                     \"batch_size\": 1, \\\n",
    "                     \"number_of_generation_per_valid\": 100, \\\n",
    "                     \"train_file\": \"molecules_P38-alpha_actives.json\", \\\n",
    "                     \"valid_file\": \"molecules_P38-alpha_actives.json\", \\\n",
    "                     \"output_name\": \"P38-alpha_example_decoys.smi\", \\\n",
    "                     \"use_subgraph_freqs\": false}'\n",
    "args['--freeze-graph-model'] = False\n",
    "args['--restore'] = '../models/DeepCoy_DUDE_model_e09.pickle'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Run 2021-01-18-13-15-29_16228 starting with following parameters:\n",
      "{\"task_sample_ratios\": {}, \"use_edge_bias\": true, \"clamp_gradient_norm\": 1.0, \"out_layer_dropout_keep_prob\": 1.0, \"tie_fwd_bkwd\": true, \"random_seed\": 0, \"batch_size\": 1, \"num_epochs\": 10, \"epoch_to_generate\": 10, \"number_of_generation_per_valid\": 100, \"maximum_distance\": 50, \"use_argmax_generation\": false, \"residual_connection_on\": true, \"residual_connections\": {\"2\": [0], \"4\": [0, 2], \"6\": [0, 2, 4], \"8\": [0, 2, 4, 6], \"10\": [0, 2, 4, 6, 8], \"12\": [0, 2, 4, 6, 8, 10], \"14\": [0, 2, 4, 6, 8, 10, 12]}, \"num_timesteps\": 7, \"hidden_size\": 100, \"encoding_size\": 8, \"kl_trade_off_lambda\": 0.3, \"learning_rate\": 0.001, \"graph_state_dropout_keep_prob\": 1, \"compensate_num\": 0, \"train_file\": \"molecules_P38-alpha_actives.json\", \"valid_file\": \"molecules_P38-alpha_actives.json\", \"try_different_starting\": true, \"num_different_starting\": 1, \"generation\": true, \"use_graph\": true, \"label_one_hot\": false, \"multi_bfs_path\": false, \"bfs_path_count\": 30, \"path_random_order\": false, \"sample_transition\": false, \"edge_weight_dropout_keep_prob\": 1, \"check_overlap_edge\": false, \"truncate_distance\": 10, \"output_name\": \"P38-alpha_example_decoys.smi\", \"subgraph_freq_file\": \"\", \"use_subgraph_freqs\": false, \"dataset\": \"zinc\", \"num_symbols\": 14}\n",
      "Loading data from molecules_P38-alpha_actives.json\n",
      "Loading data from molecules_P38-alpha_actives.json\n",
      "Restoring weights from file ../models/DeepCoy_DUDE_model_e09.pickle.\n",
      "Generated mol 0\n",
      "Generated mol 100\n",
      "Generated mol 200\n",
      "Generated mol 300\n",
      "Generated mol 400\n",
      "Generated mol 500\n",
      "Generated mol 600\n",
      "Generated mol 700\n",
      "Generated mol 800\n",
      "Generated mol 900\n",
      "Generation done\n",
      "Number of generated SMILES: 1000\n"
     ]
    }
   ],
   "source": [
    "# Setup model and generate molecules\n",
    "model = DenseGGNNChemModel(args)\n",
    "model.train()\n",
    "# Free up some memory\n",
    "model = ''"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assess generated decoys\n",
    "\n",
    "Now we need to select a final set of decoys from the candidate decoys.\n",
    "\n",
    "We will select 20 decoys per active."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Processing:  P38-alpha_example_decoys.smi\n"
     ]
    }
   ],
   "source": [
    "chosen_properties = \"ALL\"\n",
    "num_decoys_per_active = 20\n",
    "\n",
    "results = select_and_evaluate_decoys('P38-alpha_example_decoys.smi', file_loc='./', output_loc='./', \n",
    "                                     dataset=chosen_properties, num_cand_dec_per_act=num_decoys_per_active*2, num_dec_per_act=num_decoys_per_active)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following results are calculated and contained in ```results```:\n",
    "- File name - Name of input file\n",
    "- Chosen properties - Name of the property set chosen\n",
    "- Number of actives in input file\n",
    "- Number of actives after applying the minimum size filter\n",
    "- Number of candidate decoys\n",
    "- Number of unique candidate decoys\n",
    "- AUC ROC - 1NN - Performance as measured by AUC ROC of 1-nearest neighbour (1NN) algorithm in 10-fold cross-validation using all of the chosen properties\n",
    "- AUC ROC - RF - Performance as measured by AUC ROC of random forest (RF) algorithm in 10-fold cross-validation using all of the chosen properties,\n",
    "- DOE score - Deviation from Optimal Embedding score, a measure of property matching\n",
    "- LADS score - Latent Active in Decoy Set score\n",
    "- Average Doppelganger score - A measure of the structural similarity between actives and decoys\n",
    "- Maximum Doppelganger score - A measure of the structural similarity between actives and decoys\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DOE score: \t\t\t0.076\n",
      "Average Doppelganger score: \t0.210\n",
      "Max Doppelganger score: \t0.243\n"
     ]
    }
   ],
   "source": [
    "print(\"DOE score: \\t\\t\\t%.3f\" % results[8])\n",
    "print(\"Average Doppelganger score: \\t%.3f\" % results[10])\n",
    "print(\"Max Doppelganger score: \\t%.3f\" % results[11])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
